Weakly Supervised Representation Learning for Audio-Visual Scene Analysis
نویسندگان
چکیده
منابع مشابه
Weakly Supervised PatchNets: Learning Aggregated Patch Descriptors for Scene Recognition
In this paper, we propose a hybrid representation, which leverages the great discriminative capacity of CNNs and the efficiency of descriptor encoding scheme scene recognition. We make three main contributions. First, we train an end-to-end PatchNet in a weakly supervised manner, in order to extract the discriminative deep descriptors of local patches. Second, we design a novel VSAD encoding ap...
متن کاملScene Parsing by Weakly Supervised Learning with Image Descriptions
This paper investigates a fundamental problem of scene understanding: how to parse a scene image into a structured configuration (i.e., a semantic object hierarchy with object interaction relations). We propose a deep architecture consisting of two networks: i) a convolutional neural network (CNN) extracting the image representation for pixel-wise object labeling and ii) a recursive neural netw...
متن کاملWeakly-Supervised Video Scene Co-parsing
In this paper, we propose a scene co-parsing framework to assign pixel-wise semantic labels in weakly-labeled videos, i.e., only videolevel category labels are given. To exploit rich semantic information, we first collect all videos that share the same video-level labels and segment them into supervoxels. We then select representative supervoxels for each category via a supervoxel ranking proce...
متن کاملWeakly-supervised Dictionary Learning
We present a probabilistic modeling and inference framework for discriminative analysis dictionary learning under a weak supervision setting. Dictionary learning approaches have been widely used for tasks such as low-level signal denoising and restoration as well as high-level classification tasks, which can be applied to audio and image analysis. Synthesis dictionary learning aims at jointly l...
متن کاملMultimodal Visual Concept Learning with Weakly Supervised Techniques
Despite the availability of a huge amount of video data accompanied by descriptive texts, it is not always easy to exploit the information contained in natural language in order to automatically recognize video concepts. Towards this goal, in this paper we use textual cues as means of supervision, introducing two weakly supervised techniques that extend the Multiple Instance Learning (MIL) fram...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE/ACM Transactions on Audio, Speech, and Language Processing
سال: 2020
ISSN: 2329-9290,2329-9304
DOI: 10.1109/taslp.2019.2957889